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ABSTRACT.  An  explicit  understanding  of  the  opportunity  for 
constructing  new  algorithms  out  of  existing  (or  supposedly  existing) 
algorithms  is  presented. 

Say  that  B,  A^ ,  A2 ,  ...  are  problems.  We  present  an  abstract 
setting  that  provides  for  the  effective  use  of  algorithms  for  problems 
Aj ,  A2,...  for  the  design  of  efficient  algorithms  for  problem  B.  A 
notion  of  "lucid  boxes"  which  is  an  extension  of  "black  boxes"  is 
introduced  for  this  purpose. 

We  exemplify  the  applicability  of  these  lucid-box  compositions  (or 
reducibilities)  for  design  and  specification  of  efficient  algorithms 
and  when  multi-parameter  complexity  optimization  is  required. 
Multi-parameter  optimization  is  typical  to  parallel  and  distributed 
computation  environments,  where  there  is  need  to  optimize 
simultanously :  time,  sizes  of  local  memories,  communication  load  on 
many  lines ,  etc. . 
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1 .  Introduction 

Every  existing  algorithm  represents  some  knowledge  that  has  been 
acquired.  It  Is  the  responsibility  of  theoreticians  In  the  field  to 
explore  every  opportunity  to  utilize  the  knowledge  accumulated  so  far 
for  the  future  design  of  algorithms.  Often  It  is  easier  to  use 
existing  efficient  algorithms  for  another  problem  in  related  models  of 
computation  rather  than  designing  a  completely  new  algorithm. 

For  later  in  the  introduction  we  need  the  following.  It  is  well 
known  that  sequential  execution  of  a  computer  program,  say  P,  on  some 
input  I  can  be  described  similarly  to  a  proof  in  mathematical  logic. 
Associate  a  line  with  each  step  of  this  execution.  A  line  that 
corresponds  to  step  s  (>  1)  includes  the  sequence  V  of  all  input  and 
program  (including  output)  variables  and  their  contents  at  the 
beginning  of  the  step.  Next  to  it,  specify  the  (atomic)  instruction 
(with  respect  to  a  machine  or  a  programming  language)  which  is  executed 
in  step  s.  The  sequence  V  at  the  beginning  of  step  s+1  is  written  in 
the  next  line  which  is  associated  with  this  step. 

We  outline  a  methodological  framework  for  the  design  of  efficient 
algorithms  which  is  simple,  uniform  and  general.  It  provides  for  both: 
(1)  a  direct  design  of  a  procedure  from  atomic  instructions,  and  (2)  a 
build-up  of  a  composed  procedure  from  a  sequence  of  given  procedures. 
While,  for  direct  design  we  suggest  using  known  ways;  we  introduce  some 
ideas  for  composition  of  new  procedures  from  other  (specified  or  not 
fully  specified)  procedures.  It  will  be  evident  that  other  known  ways 
for  composition  of  procedures  can  be  derived  efficiently  from  our 
framework.  Our  framework  takes  full  advantage  of  the  line-by-line 
execution  description  of  existing  programs  given  above.  Therefore,  we 
coin  the  name  lucid  boxes  to  the  way  in  which  existing  programs  are 
used.  This  is  in  sharp  contrast  to  the  concept  of  'black  boxes',  where 
only  predeclared  outputs  of  existing  programs  are  transparent  to 
procedures  which  are  composed  out  of  these  existing  procedures. 

Examples,  for  which  the  full  computational  power  of  our  framework 
is  necessary  are  presented.  Thereby,  the  need  for  this  framework  is 
being  buttressed.  Actually,  some  of  our  more  powerful  examples  are 
when  we  need  to  compose  a  new  procedure  out  of  hypothetically  existing 
procedures.   Examples  I, II  and  III  demonstrate  the  applicability  of  our 
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framework  for  the  design  of  efficient  sequential,  parallel  and 
distributed  algorithms  or  for  proving  theorems  about  their  existence. 
Such  typical  theorems  assert  the  performance  evaluation  of  new 
algorithms  as  functions  of  the  performance  parameters  of  other 
algorithms.  Multi-parameter  optimization  is  often  required  in  parallel 
and  distributed  computation  environments,  where  there  is  need  to 
optimize  simultanously :  time,  sizes  of  local  memories,  communication 
load  on  many  lines,  etc..  It  is  conjectured  that  there  is  no  way  to 
derive  the  same  theorems  using  a  "black-box  build-up"  of  a  composed 
procedure  from  hypothetically  existing  ones. 

We  show  that  this  framework  encompasses  also  known  techniques  for 
reducibilities  among  problems.  The  terras  'reducibilty '  and 
'composition'  are  used  to  refer  to  the  same  operation.  The  use  of 
either  terra  relates  to  the  significance  of  the  result  rather  than  to 
the  result  itself. 

In  the  present  paper  we  say  that  a  serial  algorithm  for  a  problem 
is  efficient  if  its  running  time  is  bounded  by  the  order  of  a  "low" 
degree  polynomial  in  the  length  of  the  input  and  there  is  no  known 
algorithm  that  achieves  a  better  assymptotlcal  time  bound  for  the  same 
problem.  This  is  different  from  the  weaker  (less  pragmatical  but  more 
theoretically  robust)  notion  of  efficient  algorithras  as  used  in  the 
theory  of  NP-completeness  where  any  polynoraial  tirae  algorithm  is 
defined  to  be  efficient.  While  we  give  evidence  that  lucid-box 
compositions  would  probabely  not  affect  the  theories  that  employ  this 
weaker  definition  of  efficient  algorithms,  it  will  certainly  have  an 
impact  on  raore  pragmatical  directions  In  design  of  algorithms  (as 
Iraplied  by  our  examples)  where  our  definition  of  efficient  algorithms 
Is  raore  appropriate. 

It  is  interesting  to  note  that  lucid-box  compositions  actually 
suggest  an  answer  to  a  problem  suggested  implicitly  by  Aho,  Hopcroft, 
and  Ullman  [AHU] .  They  give  a  non-standard  definition  of 
NP-completeness  ("A  language  Lq  is  NP-complete  if  the  following 
condition  is  satisfied:  If  we  are  given  a  deterministic  algorithm  of 
time  complexity  T(n)>n  to  recognize  Lq,  then  for  every  every  language  L 
in  NP  we  can  effectively  (underlined  by  the  author)  find  a 
deterministic  algorithm  of  time  complexity  T(pT(n)),   where   p^   is   a 
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polynoralal  that  depends  on  L").  They  do  not  specify  in  the  text  what 
is  meant  by  an  'effective'  use  of  an  algorithm  for  the  purpose  of 
obtaining  another  algorithm.  We  propose  an  e xp 1 i c i t  understanding  of 
the  term  'effectively'  in  this  definition  of  NP-corapleteness.  (Not  for 
polynomial  time  reducibilities  only.) 

We  would  like  to  say  that  this  paper  does  not  belong  to  the 
conventional  specific  field  of  Programming  Languages.  On  one  hand  we 
point  at  an  apparent  deficiency  of  conventional  programming 
methdologies  and  suggest  a  wide  enough  framework  to  overcome  this 
deficiency.  Still  there  is  much  additional  work  to  be  done  in  order  to 
adapt  our  framework  to  various  existing  programming  methodologies  and 
languages.  Section  4  may  be  viewed  as  a  first  (very  small)  step  in 
this  direction. 

The  simple  example  that  follows  points  at  unsatisfactory  features 
of  the  commonly  used  "black-box-techniques"  for  compositions  of 
procedures  since  they  seem  not  to  cope  well  with  the  problem  of 
specifying  the  algorithm  being  described.  It  also  illustrates  some  of 
the  more  formal  definitions  of  the  next  section. 

An  introductory  example 

A  similar  example  to  the  one  which  is  presented  below  was  given  in 
[Meg2]  for  other  purposes.  More  on  the  family  of  examples  which  is 
represented  by  this  example  can  be  found  in  Example  III.  I  find  this 
example  both  simple  and  intriguing. 

Let  fj(X)  ~  ^i  ■•■  ^^1  »  i  ~  l.'»«.ii»  bs  pairwise  distinct 
increasing  functions  of  \  (b^  >  0).  For  every  X,  let  F(X)  denote  the 
median  of  the  set  {  f^(X) , . . . ,f^(\)}  .  Obviously,  F(A )  is  a  piecewise 
linear  monotone  increasing  function  with  0(n  )  break  points.  Given  X, 
F(X)  can  be  evaluated  in  0(n)  time  [AHU]  (once  the  f^(X)'s  have  been 
computed).  The  parametrized  median-problem:  Solve  the  equation  F(X) 
0.  One  possible  way  of  solving  this  problem  is  to  first  identify  the 
set  of  intersection  points  I  =  {X^^.  /  a^  +  X^ -bj^  =  a^  +  ^^-sbj  (i  ?^  j)}  • 
Every  breakpoint  of  F  is  equal  to  one  of  the  X^.'s.   We  can  thus  search 

the  set  I  U  {-»,-»«}  for  two  values  X^,  X^  such  that  F(X^)  <  0  <  F(\^) 

12  1 

and   such   that   there  is  no  X ^  .  in  the  open  interval  (X  ,X  ).   Once  X 

and  X '^  have  been  found,  we  can  solve  the  equation  readily   since  F   is 
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linear  over  [X  ,X  ] .  The  search  for  such  an  interval  requires  O(log  n) 
F-evaluations  and  (If  we  do  not  wish  to  presort  the  set  of  ^^-i's) 
finding  medians  of  subsets  of  the  set  of  ^j^/s  whose  cardinalities  are 
n^,  n^/2,  n^/4,  ...  .  Thus,  such  an  algorithm  runs  in  0(n  )  time, 
which   is   dominated   by   the  evaluation  of  all  the  Intersection  points 

An  alternative  approach  is  as  follows.  Let  us  start  the 
evaluation  of  F(A )  with  X  not  specified.  Assume  that  each  of  the 
branching  decisions  of  this  n-nuraber  median  computation  is  based  on  a 
comparison  between  two  input  numbers  only.  Denote  the  solution  of  the 
equation  F(X )  =  0  (which  is  of  course  not  known  at  this  point)  by  X  . 
The  outcome  of  the  first  comparison,  performed  by  the  algorithm  for 
evaluating  F,  depends  of  course  on  the  numerical  value  of  X. 
Specifically,  if  our  median-finding  algorithm  for  evaluating  F  starts 
with  comparing  f^(X)  and  f2(X),  then  the  intersection  point  X^2  is  a 
critical  value  for  this  comparison;  namely,  for  X  >  X  ^^2  >  ^  i  (^  )  ^  f2^^^ 
while  for  X  <  X,2  ,  fi(^)  <  f2(^)>  or  vice  versa.  Thus,  we  may  find 
X,2  .  evaluate  F(X,2).  sind  then  decide  whether  X  >  X -^2  °^  ^  **  ^12 
according  to  the  sign  of  F(X,2)«  We  can  then  proceed  with  the 
evaluation  of  F(X),  where  X  is  still  not  specified  but  now  restricted 
either  to  (-".X.^l  or  to  [X,2>°°)'  The  same  idea  is  repeatedly  applied 
at  the  following  points  of  comparison.  In  general,  when  we  need  to 
compare  f.  with  f  •  and  X  is  currently  restricted  to  an  interval 
[X',X"],  then  if  X..  does  not  lie  in  the  Interval  then  the  outcome  of 
the  comparison  is  uniform  over  the  interval;  otherwise,  by  evaluating 
F(X..),  we  can  restrict  our  interval  either  to  [X  '  ,X  ^j^  .  ]  or  to  [Xj^.,X"]. 

By  the  correctness  of  the  median  algorithm  it  can  readily  be  seen 
that  we  finally  restricted  ourselves  to  an  Interval  in  which  F(X )  is 
linear. 

Since  0(n)  such  comparisons  are  being  performed  in  the  known 
median-finding  algorithm  and  each  may  require  an  evaluation  of  F  (which 
amounts  to  one  median-finding),  it  follows  that  such  an  algorithm  runs 
in  O(n^)  time.  Note  that  [Meg2]  mentions  a  linear  time  algorithm  for 
this  problem  that  uses  a  straightforward  technique.  However,  Example 
III  lists  references  to  many  algorithms  that  use  a  similar   technique 
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and   improve   on   Che  time  complexity  of  all  their  known  "conventional" 
counterparts . 

The  connection  to  the  earlier  discussion  is  as  follows.  We  want 
to  solve  the  parametrized  median  problem.  For  this  we  use  a  median 
algorithm  in  a  comparison  model  of  computation,  as  was  specified  above. 
It  should  be  evident  that  we  actually  make  a  very  strong  use  of 
information  which  is  available  from  the  aforementioned  line-by-line 
description  of  an  execution.  Namely,  before  each  comparison  we  stop 
the  actual  execution,  make  some  computations  (that  use  the  parameters 
of  the  comparison)  on  the  side;  and  then  proceed  (from  the  same  point) 
with  the  execution  of  the  median  algorithm  by  "feeding"  the  variables 
that  have  to  be  compared  so  as  to  affect  properly  the  result  of  the 
comparison.  So,  we  can  say  that  steps  of  the  execution  of  the  median 
algorithm  were  transparent  to  us  and  we  could  intervene  in  this 
execution  by  suspending  it  and  changing  contents  of  variables.  The 
suspension  part  is  similar  to  the  notion  of  coroutines  in  the  sense 
that  whenever  a  coroutine  is  activated,  it  resumes  execution  of  its 
program  at  the  point  where  the  action  was  last  suspended  (see  [K]).  We 
presented  this  example  for  a  simple  demonstration  of  the  following: 

•  there  is  information  which  is  inherent  in  the  execution  process   of 
any  "reasonable"  existing  algorithm; 

•  this  information  is  not  contained  in  usual  output  specifications  of 
algorithms; 

•  it  is  advantageous  to   use   this   information   as   soon   as   it   is 
available  rather  than  run  the  existing  algorithm  to  the  end. 

Remark.  Knuth  [K]  presents  examples  where  coroutines  are  used  for 
elegant  and  concise  (simulation  of)  system  modeling.  However, 
regarding  the  use  of  coroutines  for  algorithms  Knuth  asserts  that  "It 
is  rather  difficult  to  find  short,  simple  examples  of  coroutines  which 
illustrate  the  importance  of  the  idea".  It  seems  that  such  an  example 
has  been  found  here.  Of  course,  any  median-finding  algorithm  can  be 
modified  to  solve  the  parametrized  median  problem.  But  if  we  insist  on 
'locking  its  mechanism'  and  operating  only  on  its  output  then  the 
coroutine  notion  seems  to  imply  better  running  time  than  the  subroutine 
notion.  The  reason  for  this  is  as  follows.  Suppose  that  we  define  all 
intermediate   results   and   the  complete  execution  to  belong  to  the 


a 
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output.  We  still  need  to  perform  the  median  algorithm  all  the  way  from 
the  beginning  in  order  to  resume  operation  from  the  point  which  it  was 
ctually  aborted  in  the  previous  call.  Since  otherwise  we  would  not 
know  which  comparison  to  perform  next  as  it  depends  on  the  results  of 
previous  comparisons. 

The  next  section  demonstrates  a  precise  characterization  of  the 
lucid-box  composition  technique  in  some  (arbitrarily  chosen) 
computaional  environment.  Section  3  relates  the  technique  to  known 
concepts  of  reducibilities  and  discusses  implications  it  might  have. 
Section  4  examines  a  possible  structured-programming  description  of 
applications  of  the  lucid-box  technique.  Examples  1,11,  and  III 
demonstrate  the  use  of  this  technique  for  efficient  distributed, 
parallel  and  serial  computation.  We  preferred  to  put  the  examples  at 
the  end  in  order  to  enable  an  uninterrupted  discussion  of  the  features 
of  the  lucid-box  technique. 

2 .  Lucid-box  Compositions  of  Procedures. 

We  sketched  the  notion  of  lucid-box  compositions  in  the  previous 
section.  A  precise  definition  of  this  notion  in  a  specific 
random-access-machine  (RAM)  environment  is  given  in  the  present 
section.  We  hope  that  these  in  conjunction  with  the  examples 
throughout  the  paper,  will  guide  the  reader  to  a  proper  interpretation 
of  lucid-box  compositions  in  other  computational  environments  than  the 
one  described  in  this  section.   This  is  the  main  goal  of  this  section. 

We  try  to  give  a  self-contained  presentation.  However,  in  a  few 
cases  the  reader  will  be  referred  to  [AHU].  As  a  mean  of  presentation 
we  specify  places  in  this  book  where  'patches'  are  proposed.  The 
reader  is  assumed  to  be  familiar  with  the  contents  of  sections  1.2, 
1.3,  1.4  and  1.8  where  the  random  access  machine  (RAM),  the 
random-access-stored-program-machine  (RASP)  and  Pidgin  ALGOL  are 
introduced.  Unlike  conventional  programming  languages  Pidgin  ALGOL 
programs  should  be  read  by  a  human  reader  rather  than  a  machine. 
Pidgin  ALGOL  permits  a  succinct  presentation  of  algorithms  in  this 
book.  A  Pidgin  ALGOL  program  can  be  translated  into  a  RAM  or  RASP 
program  in  a  straightforward  manner.   It  is  necessary  to  consider  time 
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and  space  required  to  execute  the  code  corresponding  to  a  Pidgin  ALGOL 
statement  or  a  RAM  or  RASP.  Later  we  say  how  to  extend  Pidgin  ALGOL  in 
order  to  include  lucid-box  composition  of  procedures.  For  the  sake  of 
the  current  presentation  we  introduce  a  machine  which  is  slightly 
different  from  both  the  RAM  and  RASP.  The  changes  relative  to  the  RAM 
in  the  definition  of  this  machine  reflect  proposed  changes  in  the 
definition  of  the  extended  Pidgin  ALGOL  with  respect  to  Pidgin  ALGOL; 
thereby,  implying  how  to  translate  statements  of  the  extended  Pidgin 
ALGOL  into  this  machine.  We  call  this  new  machine 
randora-access-raeraory-and-program  machine  (RAMP).  The  only  difference 
between  the  RAMP  and  the  known  RAM  is  that  the  program  is  located  in  a 
read-only  memory.  Recall  that  the  indirect  read  instruction  'READ  *i' 
means:  "copy  the  register  whose  number  is  the  content  of  register  i 
into  the  accumulator".  The  'READ  *i'  instruction  of  our  machine  may 
refer  also  to  locations  (registers)  in  the  program  part  of  the  machine. 
(The  instructions  are  encoded  by  integers.  For  an  example  of  a  similar 
encoding  see  the  presentation  of  the  RASP  in  [AHU].)  This  book  shows 
that  the  order  of  time  (and  space)  complexities  of  the  RAM  and  RASP  are 
the  same  for  the  same  algorithm  if  the  cost  of  instructions  is  either 
uniform  or  logarithmic.  No  new  ideas  are  required  in  order  to 
establish  similar  relationships  between  the  RAM  (or  RASP)  and  the  RAMP. 

Let  P,,P2,...  be  existing  programs  which  are  written  for  a  RAMP. 
We  present  a  framework  for  specifying  a  new  program  P  by  using  these 
programs.  We  do  it  in  two  stages.  First,  we  specify  P  for  a  model  of 
computation  which  contains  the  RAMP.  Later,  we  show  a  possible  way  of 
translating  P  into  the  RAMP.   This  may  clarify  the  choice  of  the  RAMP. 

The  model  of  computation  for  which  P  is  defined  contains  the  RAMP 
in  the  following  sense.  It  employs  a  sequence  of  RAMPs  Rg,R,,...  . 
The  main  program  for  P  is  located  in  Rq.  This  program  may  be 
constructed  out  of  the  usual  (optionally)  labeled  RAMP  instructions 
with  respect  to  Rq.  In  addition  the  RAMPs  R,,R2,...  are  attached  to 
Rq  in  a  "slave-master"  relation.  Any  one  of  Pi,P2,...  may  be  run  on 
each  of  these  RAMPs  in  the  usual  way  with  one  exception: 

The  RAMP  R^  ,  i  >  1,  has  a  distinguished  additional  square  called 
"the  dorainator  of  R^  "  (d(Rj)  for  short)  that  enables  Rq  to  control  its 
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operatlon.  P.  is  changed  so  that  the  following  pair  of  instructions 
must  be  accessed  before  every  access  to  any  of  its  instructions. 

1.  Enter  'red'  into  d(Rj^). 

2.  Proceed  only  if  d(Rj^)  contains  'green'.  (Only  Rq  may  enter  'green' 
into  d(R.)  and  R^  has  to  wait  until  Rq  does  so.) 

The  program  for  Rq  may  also  have  instructions  for: 

1.  Performing  all  the  RAMP  instructions  with  respect  to  any  of  the 
input,  memory  or  output  squares  of  each  R.  (with  respect  to  the 
accumulator  of  Rq). 

2.  Starting  any  P-  on  any  R^. 

3.  Entering  'green'  into  d(R^),  i  >  1. 

4.  Reading  the  P.  instructions  and  especially  the  next  instruction  to 
be  executed  into  the  accumulator  of  R. 

We  have  already  made  our  point  in  the  previous  section  regarding 
how  to  relate  this  composition  scheme  to  our  introductory  example.  Our 
procedure  starts  the  median  algorithm  on  R,  .  Whenever  a  comparison 
between  two  (copies  of)  inputs  of  the  median  algorithm  is  going  to  take 
place,  our  procedure  does  the  following.  It  'suspends'  the  median 
algorithm,  makes  all  the  side  computations  required,  determines  the 
result  of  the  comparison  by  assigning  fictitious  values  to  the 
corresponding  locations  of  R,  and  resumes  the  operation  of  the  median 
algorithm. 

Let  us  overview  a  possible  way  for  mapping  our  composed  procedure 
into  a  single  RAMP  denoted  S. 

1.  Proper  versions  of  the  main  programs  for  P  and  the  programs 
P,,P9,...   are  located  in  the  program  part  of  S. 

2.  An  easy  way  to  compute  mapping  from  input,  memory,  output  and 
location-counter  locations  of  Ri,R2,...  and  memory  and 
location-counter  locations  of  Rq  ,  into  the  memory  of  S  has  to  be 
defined.  Such  a  mapping  has  to  use  as  little  memory  space  of  S  as 
possible.  We  present  one  possible  mapping.  It  seems  to  be  especially 
appropriate  for  the  (difficult)  case  where  the  number  of  RAMP-s  which 
are  actually  employed  (denoted  by  n)  and  the  maximum  size  of  a  RAMP's 
memory  over  all  their  uses  (denoted  by  ra)  are  not  known  in  advance. 
Some  easier  cases  may  readily  result  in  more  efficient  mappings.   Let 
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the  serial  number  N(i,j)  of  location  1  of  RAMP  R .  (1  >  1 ,  j  >  0)  be  the 
cardinality  of  the  set 

{(Jl,k)  I  ((A  +  k  <  1  +  j)  OR  (£  +  k  =  i  +  j  AND  k  <  j)) 

AND  I  >  1  AND  k  >  0} 
(Intuitively,  the  array  of  pairs  (£,k),  £  >  1  and  k  >  0,  is  sorted  in  a 
lexicographic  order;  first  by  the  (finite  length)  diagonal  that  crosses 
a  pair  and  second  by  the  serial  number  of  this  pair  on  this  diagonal). 
So,  we  map  location  i  of  RAMP  R.  into  memory  location  N(i,j)+c  of  RAMP 
S,  where  c  is  some  constant.  This  guarantees  that  S  uses  0((n+m)'') 
memory  locations. 

3.  The  transition  of  control,  which  is  done  by  updating  the  d(R.)-s, 
can  be  readily  simulated.  For  this,  'activate'  the  location  of  S  which 
corresponds  to  the  location-counter  of  R^  . 

It  should  be  clear  that  if  a  number  of  steps  in  the  'source' 
composed  procedure  is  T,  then  the  number  of  steps  in  the  translation 
into  the  single  RAMP  is  0(T). 

Let  us  go  back  to  the  extended  Pidgin  ALGOL.  Procedures  can  now 
be  invoked  by  either  calling  them  for  the  first  time  or  resuming  their 
operation.  Their  complete  execution  (in  the  way  it  is  described  in  the 
introduction)  can  be  used. 

Examples  I,  II  and  III  demonstrate  more  instances  where  this  power 
of  composition  of  procedures  is  useful.  The  reader  is  advised  to  read 
them  at  this  stage.  [V2]  contains  examples  for  applications  of  the 
lucid-box  composition  technique  in  synchronous  parallel  and  distributed 
computation.  One  of  these  examples  is  summarized  in  Example  I.  Example 
II  reviews  another  application  for  synchronous  distibuted  computation 
taken  from  [VI].  A  lucid-box  composition  that  uses  sorting  and  merging 
networks  is  given.  Example  III  includes  references  for  further 
applications  in  parametrized  computing.  An  application  in  asynchronous 
distributed  computation  can  be  found  in  [V3] .  In  order  to  keep  this 
presentation  within  reasonable  length  we  avoid  saying  more  about  it. 


3 .  Relations  with  Other  Notions  of  Composition. 

We  first  refer  to  two  reducibilities  which  are   commonly  used. 


-11- 

Deflnitlons  of  widely  used  terras  are  sometimes  omitted.  They  can  be 
found  in  [GJ] .  The  most  popular  technique  in  the  literature  for 
showing  that  an  algorithm  for  one  decision  problem  can  be  used  for  a 
solution  of  another  decision  problem  uses  transformation 
reducibilities.  That  is,  a  constructive  transformation  that  maps  any 
instance  of  the  first  problem  into  an  equivalent  instance  of  the  second 
is  given.  Such  a  transformation  enables  us  to  convert  any  algorithm 
for  the  second  problem  into  a  corresponding  algorithm  for  the  first 
problem.  The  well  established  notion  of  composition  of  functions 
suggested  the  composition  of  a  function  on  an  existing  one,  thereby 
creating  a  new  function.  This  is  somewhat  similar  to  the  way 
transformation  reducibilities  are  defined.  However,  the  fact  that  an 
execution  of  an  existing  algorithm  may  include  much  more  information 
than  its  output  is  actually  ignored  by  the  transformation  reducibility. 
On  the  other  hand,  the  extensive  applicability  of  this  reducibilty 
suggests  that  in  spite  of  its  narrowness  it  often  focuses  on  the  right 
things. 

The  known  generalization  of  transformation  reducibility  (see  [GJ]) 
strengthens  our  point  of  similarity  to  composition  of  functions  even 
more.  A  Turing  reduction  from  one  (search  or  decision)  problem  to 
another  is  an  algorithm  that  solves  the  first  problem  by  using  a 
hypothetical  subroutine  for  the  second  problem.  This  subroutine  can  be 
called  more  than  once.  Each  time  the  subroutine  is  called  it  operates 
(from  the  point  of  view  of  the  algorithm)  as  the  function  it  realizes; 
i.e.,  the  input  for  an  application  of  this  subroutine  is  written  by  the 
algorithm  in  some  specified  memory  location  and  then  the  subroutine 
responds  by  writing  the  output  in  some  specified  memory  location.  A 
similar  notion  of  operation  is  sometimes  referred  to  as  'black  boxes'. 
Polynomial  time  transformation  reducibility  is  often  called 
"Karp-reducibility"  while  polynomial  time  Turing  reducibility  is  often 
called  "Cook-reducibility".  See  Section  5.2  in  [GJ]  for  a  history  of 
terminology. 

We  already  Implied  that  we  can  alternate  freely  between 
"compositions"  and  "reducibilties".  Thus,  we  can  use  lucid-box 
compositions   for   reducibilities  between  procedures.   Since  we  called 
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the  composition  "lucid-box  composition"  we  call  the  reducibility 
"lucid-box  reducibility". 

Transformation  and  Turing  reducibilities  are  instances  of 
lucid-box  reducibilities  since  they  permit  us  to  define  reducibilities 
among  sets  of  solutions  for  problems.  Unlike  these  reducibilities 
lucid-box  reducibilities  permit  us: 

(1)  To  restrict  sets  of  solutions. 

In  our  examples  we  demonstrated  how  useful  it  might  be  to  restrict 
ourselves  to  comparison  models  of  computation,  or  to  force  the  copying 
assumption  or  to  deal  with  comparison  networks  only. 

(2)  To  use  the  full  information  in  the  execution  of  procedures  and  not 
only  their  output. 

(3)  To  relate  closely  the  resource  complexity  of  the  new  procedure  to 
the  resource  complexity  of  the  procedures  which  are  used  by  the 
lucid-box  reducibility. 

Our  examples  demonstrate  the  applicability  of  lucid-box  compositiens  or 
reducibilities  for  design  and  specification  of  efficient  algorithms  and 
in  cases  where  multi-parameter  complexity  optimization  is  required.  As 
we  already  implied  in  the  examples  multi-parameter  optimization  is 
typical  to  parallel  and  distributed  computation  environments,  where 
there  is  need  to  optimize  simultanously:  time,  sizes  of  local  memories, 
communication  load  on  many  lines,  etc.. 

A  natural  question  to  be  asked  is  about  polynomial  time  lucid-box 
reducibilities;  and,  in  particular,  do  they  affect  the  theory  of 
NP-completeness?  Unlike  the  design  of  efficient  algorithms,  where  new 
opportunities  are  opened,  we  show  in  the  remainder  of  this  section  that 
the  answer  to  this  question  is  essentially  negative.  The  argument,  as 
we  shall  see,  is  simple.  It  is  based  on  the  following  observation. 
Given  input  variables  and  a  program  that  operates  on  them  it  is  an 
arbitrary  (semantical)  decision  which  program  variables  are  declared  to 
be  outputs. 

Extending  the  notion  of  lucid-box  reducibility  to  Turing  machines 
and  the  related  definitions  of  NP-completeness  is  straightforward  and 
therefore  omitted.   In  the  introduction  we  presented  the  definition  of 
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[AHU]  for  NP-completeness.  Obviously,  every  problem  which  is 
NP-coraplete  using  Cook's  reducibility ,  is  NP-complete  using  [AHU]'s 
reducibility.  [Ev]  implies  that  it  is  an  open  question  if  the  other 
direction  holds  as  well.  The  following  sentence  relates  to  the 
definition  of  NP-completeness  of  [AHU] ,  given  above.  Note  that  both 
the  program  for  recognizing  Lq  and  its  execution  are  transparent  to  us 
in  a  lucid-box  reducibility  from  the  problem  of  recognizing  L  to  the 
problem  of  recognizing  Lq  .  This  seems  to  imply  that  our  definition  of 
NP-completeness  is  equivalent  to  [AHU],  since  nothing  can  stop  us  from 
using  in  any  "effective"  way  the  program  for  recognizing  Lq. 

We  would  like  to  elaborate  a  little  on  terminology  which  is  being 
used.  [GJ]  defines  a  search  problem  tt  to  consist  of  a  set  D^  of  finite 
objects  called  instances  and,  for  each  instance,  I  e  D^  ,  a  set  S^(I) 
of  finite  objects  called  solutions  for  I.  An  algorithm  is  said  to  solve 
a  search  problem  it  if,  given  as  input  any  instance  I  e  D^  >  it  returns 
the  answer  "no"  whenever  S^(I)  is  empty  and  otherwise  returns  some 
solution  s  belonging  to  S^(I).  Define,  alternatively,  a  proceduFal 
solution  to  ir  to  consist  of  an  algorithm  that  solves  tt  including  all 
its  intermediate  computations.  Assume  that  instead  of  algorithms  that 
solve  search  or  decision  problems  we  were  interested  in  procedural 
solutions  of  these  problems.  By  definition,  any  call  to  a  procedural 
solution  for  some  problem  results  in  a  listing  of  all  its  intermediate 
computations  throughout  the  T  time  units  during  which  the  procedural 
solution  ran  (for  some  T).  Right  after  the  example  in  the 
introduction,  we  remarked  that  a  coroutine  that  runs  in  time  0(T)  can 
be  simulated  by  a  subroutine  that  runs  in  time  0(T  )  by  restarting  it 
instead  of  resuming  its  operation  from  the  place  it  was  stopped. 

Therefore,  if  in  a  reduction  of  the  procedural  solutions  of  one 
problem  to  the  procedural  solutions  of  another  problem  we  used  only 
executions  of  solutions  to  the  second  problem  we  can  simulate  it  by  a 
polynomial  time  Turing  reduction.  However,  the  (bizzare)  possibility 
of  using  the  program  for  the  second  problem  in  other  ways  than 
executing  it  still  leaves  the  question  of  [Ev]  "formally"  open. 

Remark.  It  might  be  interesting  to  define  an  "algebra  of  procedures" 
as  follows.   Its  elements  (the  procedures)  will  be  pairs  which  consists 
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of  a  program  and  input  and  program  variables  (but  must  not  have  output 
variables).  Lucid-box  compositions  will  serve  as  the  operation  of  this 
algebra  since  they  allow  for  producing  a  procedure  out  of  existing 
ones.  This  will  be  analogous  to  an  algebra  of  algorithms  that  can  be 
defined  using  black-box  compositions.  We  do  not  elaborate  on  this  idea 
here  and  leave  it  for  future  research. 

4 .  On_  Structured  Programming. 

A  growing  number  of  programs,  documentation  of  programs  and 
presentations  of  algorithms  in  the  literature  use  principles  of 
structured  programming  for  modularity  and  clarity  of  exposition. 

A  methodologically  useful  way  to  present  algorithms  in  this  spirit 
is  to  start  with  a  high-level  description  of  the  algorithm  which 
includes  a  milestones  in  their  performance.  In  a  hierarchical  fashion 
this  overview  is  filled  with  details,  until  a  complete  low-level 
accurate  specification  is  obtained.  For  instance,  take  the 
biconnectivity  algorithm  using  depth-first-search  (DFS)  in  [AHU] . 
After  presenting  the  DFS  method  for  searching  a  graph  (to  which  we 
refer  as  a  high-level  description)  this  book  presents  again  the  DFS  now 
intermixed  with  the  bookkeeping  required  for  the  biconnectivity 
algorithm.  An  alternative  approach  that  we  would  like  to  point  out 
looks  at  the  DFS  as  a  navigator  that  tells  us  where  to  go  next,  while 
there  is  a  bookkeeper  that  should  be  told  where  we  are,  in  order  to  be 
able  to  do  the  necessary  bookkeeping.  We  suggest  the  following 
description.  There  is  a  'general  manager'  that  coordinates  between  the 
navigator  ('president  of  the  company')  and  the  bookkeeper.  The  general 
manager  asks  the  president  where  to  go,  and  then  transmits  the  response 
to  the  bookkeeper.  The  bookkeeper  writes  down  whatever  is  required  and 
reports  to  the  general  manager  when  he  is  done.  Then  the  general 
manager  consults  the  president  again,  and  so  on. 

The  modularity  of  the  presentation  is  increased  since  now  the 
high-level  description  is  an  independent  piece  of  software.  Besides 
the  important  advantage  of  clarity  we  would  like  to  point  out  the 
following  possible  advantage:  In  many  cases  a  first  version  of  an 
algorithm  is  improved  later  by  polishing  its  low-level  implementation 
(bookkeeping)  only.   It  is  desirable   in  such  cases   to  utilize   the 
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prevlous   fault-free  high-level  specifications.   If  it  is  written  as  an 
Independent  unit  it  can  be  used  as  is. 

[K]  mentions  the  methodological  importance  of  understanding 
several  coroutines  as  (symmetric)  equal  partners,  unlike  a  master-slave 
relation  between  a  program  and  a  subroutine  it  calls.  The  above 
hierarchical  description  may  well  fit  this  understanding  by  naming  a 
new  general  manager  who  employs  all  coroutines  involved.  So  no 
coroutine  dominates  another. 

Examples 

Example  I.  Choice  of  a  model  of  parallel  computation. 

The  paper  [V2]  deals  with  the  problem  of  choosing  a  theoretical 
abstract  model  of  parallel  computation  to  be  simulated  by  synchronous 
distributed  machines.  The  principle  of  choosing  the  most  permissible 
model  of  parallel  computation  as  long  as  the  cost  of  computational 
resources  does  not  increase  is  applied. 

We  sketch  the  main  ideas  in  this  paper  emphasizing  the  ones  that 
relate  to  lucid-box  compositions. 

Our  machine  is  assumed  to  be  represented  by  a  model  of  synchronous 
distributed  computations  (SDC).  It  employs  a  sequence  of  RAM's  (see 
[AHU])  PpP2,...,Pj  that  operate  synchronously  in  parallel.  Each 
processor  can  communicate  directly  with  no  more  than  c  other  processors 
(where  c  is  some  "small"  constant)  through  communication  lines. 
Communication  registers  which  are  associated  with  the  lines  are  used 
for  the  communication. 

The  concurrent-read  exclusive-write  parallel  RAM  (CREW  PRAM)  is  a 
synchronous  model  of  parallel  computation  in  which  all  p  processors 
Pi,...,P  have  access  to  a  shared  memory.  Simultaneous  access  to  the 
same  common  memory  location  is  allowed  for  read  (but  not  write) 
purposes.  The  Fetch-and-Add  (F&A)  PRAM  model  of  synchronous  parallel 
computation  allows  every  operation  which  is  permitted  by  the  CREW  PRAM. 
In  addition,  the  following  is  allowed.  Let  A  be  a  common  memory 
address  and  let  ei  be  some  address  in  the  local  memory  of  processor 
P,  .  We  define  the  F&A  instruction  as  follows.  If  processor  P, 
performs  F&A(A,e^)  and  no  other  processor  performs  at  the  same  time  an 
instruction  that  relates   to  address  A  then  the   content   of  A  is 
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transraitted  Co  processor  P.  and  stored  In  one  of  its  local  registers 
and  address  A  is  assigned  with  A+e,  .  Suppose  that  several  processors 
perform  simultaneously  F&A  instructions  that  relate  to  A.  The  result  is 
defined  to  be  as  if  these  instructions  were  performed  serially  in  some 
order. 

Generally  speaking  the  main  result  of  the  paper  is  that  every 
simulation  of  algorithms  given  in  the  CREW  PRAM  into  the  SDC  has  a 
counterpart  simulation  for  algorithms  given  in  the  F&A  PRAM  into  the 
SDC  which  requires  the  same  SDC-time  and  SDC-local  memories  up  to  a 
constant  factor.  This  supports  the  choice  of  the  F&A  PRAM  as  an 
abstract  model  of  parallel  computation  as  was  done  for  the 
NYU-Ultracomputer  [GGKMRS]. 

We  sketch  the  main  ideas  that  relate  to  lucid-box  compositions. 
Let  us  take  a  pulse  of  the  F&A  PRAM.  Assume,  that  in  this  pulse 
processors  i^  ,i2  , . . .  ,ii.  want  to  execute  F&A(Ai,ei),  F&ACAo  .eo)  , . .  • , 
F&A(Ai^,e.  ),  respectively,  where  A.  is  a  common  memory  address  and  e.  is 
some  address  in  the  local  memory  of  processor  i.  ,  for  I  <  j  <  k.  All 
other  processors  of  the  F&A  PRAM  are  assumed  to  remain  idle  during  the 
present  pulse.  Finally,  we  arrived  at  the  point  where  the  present 
example  relates  to  lucid-box  compositions.  We  'replace'  the  given 
pulse  by  the  following:  every  F&A(A-,e-)  instruction  for  processor  ij 
is  replaced  by  the  insruction  "read  common  address  A-  into  local 
address  e^  of  processor  i-:",  for  1  <  j  <  k.  We  apply  the  simulation  of 
the  CREW  PRAM  into  the  SDC  in  order  to  simulate  this  'reading'  pulse. 
This  simulation  is  done  in  auxiliary  memory  locations  of  the  SDC  and 
all  cases  where  the  contents  of  a  (memory  or  communication)  location  of 
the  SDC  is  copied  into  another  are  recorded  in  the  local  memory  of  the 
processor  that  does  it.  Let  A-  (resp.  e-  be  a  memory  location  of  the 
SDC  which  simulates  memory  location  A.  (resp.  e-).  We  assume  that 
there  is  exactly  one  such  A.  (resp.  ^-i)*  "^^^  correctness  of  the 
simulation  implies  that  the  contents  of  location  A.  is  propagated  from 
A.  to  ez  between  the  time  the  simulation  of  the  present  reading  pulse 
begins  until  the  time  it  ends.  Say  that  it  takes  T  cycles  of  the  SDC 
machine.  We  assume  that  this  propagation  occurs  by  repeatedly  copying 
the  contents  of  A^  from  one  SDC  memory  location  to  another  until  it  is 
finally  copied  into  e-.   (No  splitting  of  the  bits  of  this  contents   or 
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encodlng  of   contents  of  several  locations  into  one  are  allowed.)  This 
is  called  the  copying  assumption. 

Let  A  be  the  set  of  all  memory  locations  (in  all  local  memories) 
of  the  SDC.  The  following  (layered)  directed  graph  G(V,E)  is 
introduced  as  a  tool  for  specifying  the  synchronization  of  the 
remaining  steps;  edges  between  two  successive  layers  represent 
simultaneous  events  that  take  place  between  two  successive  ticks  of  our 
'clock'.  The  set  of  vertices  V  of  G  is  V  =  {(a,t)  |  a  e  A,  0  <  t  <  T}  . 
Layer  t,  L^,  ,  of  G  is  L^  =  {(a,t)  |  a  e  A}  for  t,  0  <  t  <  T.  Let 
E^  =  {[(a,t-l),  (b,t)J  I  A  (SDC)  processor  P^^  ,  for  some  1  <  i  <  s, 

copied  the  contents  of  address  a  into 
address  b  at  time  unit  t,  1  <  t  <  t} 
E2  =  {((a,t-l),  (a,t)J  I  No  processor  wrote  into  address  a  at 

time  unit  t,  1  <  t  <  t} 
The  set  of  edges  E  is  the  union  of  E-^    and  E2  •   Note,  that  all  edges  of 
E,   were  actually   recorded  at   the   local  memories  of  corresponding 
processors. 

It  is  not  difficult  to  observe  that  binary  trees  with  A-  as  roots 
and  e-  as  leaves  are  subgraphs  of  G.  The  simulation  of  the  F&A 
instructions  proceeds  by  identifying  these  trees  and  implementing  a 
synchronous  partial-sums  computation  on  them  in  order  to  satisfy  the 
F&A  instructions.  The  following  facts  follow  readily:  the  time 
required  for  the  simulation  is  0(T);  and,  if  processors  Pi.P2»'**^d 
employed  local  memories  of  sizes  ra, ,m2,...,raj  ,  respectively,  for  the 
simulation  of  the  reading  pulse  then  they  employ  local  memories  of  size 
0(m^+T),  0(m2+T),  ...,  0(m^+T) ,  respectively,  for  the  simulation  of  the 
pulse  that  involved  the  F&A  instructions. 

The  refined  corespondence  between  many  performance  parameters  of 
an  existing  procedure  and  corresponding  parameters  of  a  new  procedure 
seems  to  be  relevant  for  many  computational  environments :  For  instance, 
an  algorithm  for  a  synchronous  parallel  shared  memory  model  of 
computation  may  be  evaluated  by  its  number  of  steps,  size  of  each  of 
its  local  and  common  memories,  number  of  accesses  to  the  shared  memory 
and  their  frequency  and  many  more  parameters.  Another  example  is  an 
algorithm  for  synchronous  distributed  machines.   They  may  be  evaluated 
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by   communication  load  on  each  line,  sizes  of  local  memories  and  number 
of  steps. 

The  composition  of  the  simulation  of  the  pulse  that  involves  F&A 
instructions  using  execution  of  the  simulation  of  the  reading  pulse 
gave  us  a  many  parameter  correspondence  between  the  two  simulations. 
The  notion  of  lucid-box  compositions  allows  for  a  closer  adherence  to 
the  existing  procedure  and  enables  us  to  use  Information  which  one 
cannot  expect  to  find  in  typical  predeclared  outputs.  Therefore,  we 
claim  that  lucid-box  compositions  are  tailored  for  multi-parameter 
optimization  of  algorithms. 


Example  II.  A  Parallel-Design  Distributed-Implementation  (PDDI) 
General-Purpose  Computer. 

The  paper  [VI]  introduces  a  scheme  of  an  efficient  general-purpose 
parallel  computer.  Its  design  space  (i.e.  the  model  for  which 
parallel  programs  are  written),  is  the  F&A  PRAM.  The  implementation 
space  is  presented  as  a  scheme  of  an  SDC  in  which  each  processor  may 
communicate  with  <  4  others.  We  sketch  below  how  lucid-box 
compositions  may  help  in  the  construction  of  an  efficient  translation 
of  the  design  space  into  the  implementation  space. 

The  F&A  PRAM  employs  p  processors  l,2,...,p  and  m  common  memory 
addresses  l,2,...,m. 

The  SDC  consists  of  a  sorting  network  followed  by  a  merging 
network  as  in  Figure  1.  Any  sorting  and  merging  network  can  be  used. 
The  SDC  employs  d  (<  p)  strong  processors  (called  super-processors); 
each  of  them  is  responsible  to  simulate  the  behavior  of  about  p/d  F&A 
PRAM  processors.  It  also  employs  'comparator  processors'  that  behave 
similarly  to  comparator  modules  of  comparison  networks  and  m  'memory 
processors',  each  of  them  is  responsible  for  simulating  the  behavior  of 
one  common  memory  location.  We  present  the  solution  only  for  a  pulse 
of  the  F&A  PRAM  of  the  following  form.  Assume  that  processors 
i^,i2,. . .  ,i(^  want  to  execute  F&A(Aj,ej),  F&A(A2,e2),  ...,  F&A(A,^,e^^) 
where  A^  is  a  common  memory  address  and  e^  is  an  address  of  the  local 
memory  of  processor  i^  ,  1  <  j  <  k,  respectively.  All  other  processors 
of  the  F&A  PRAM  remain  idle  during  the  present  pulse. 
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The   translation   Into   the   SDC   proceeds  in  cycles.   At  the  j-th 

cycle  super-processor  i  simulates  the  behavior  of  processor  i+(j-l)d,  1 

<   1<   d,   l<j<|£.|.   Let  us  describe  now  the  first  cycle.   It 

d 

consists  of  four  steps.  The  F&A  instructions  of  processors  i-  such 
that  1  <  ^i  ^  '^>  ^°^  1  <  j  <  k.  are  being  simulated.  Denote  these 
processors  by  ii  ,  i2  , .  •  .  ,  ii,  ,  for  some  kj  . 

Step  1.  By  the  sorting  network  of  the  SDC,  we  sort  the  pairs  (A.,i.), 
for  j  =  l,...,k,  ,  according  to  a  lexicographic  order.  Comparator 
processors  serve  as  comparator  modules.  They  also  transmit  the 
contents  of  the  e.  cells  and  keep  records  of  their  activities  at  each 
time  unit  of  the  sorting.  The  output  sorted  list  contains,  in 
successive  locations,  F&A  instructions  that  relate  to  the  same  common 
memory  location. 

The  records  which  are  kept  through  the  sorting  algorithm  will  be 
used  later  in  order  to  send  back  to  the  super-processor  messages  which 
correspond  to  ones  that  were  forwarded  through  the  sorting  network. 
Here,  we  already  observe  some  form  of  lucid-box  composition  where 
intermediate  results  of  a  (not  fully  specified)  sorting  network  are 
being  used  for  another  procedure.  However,  the  more  interesting  use  of 
lucid-box  compositions  is  being  made  in  the  next  two  steps. 

Step  2_.  This  output  sorted  list  and  the  (sorted)  list  of  memory 
addresses  are  merged  by  the  merging  network  of  the  SDC.  (A  comparison 
of  a  pair  (A-,i.)  and  a  memory  location  f  is  defined  as  follows: 
(A.,i.)  >  f  if  A^  >  f).  Comparator  processors  serve  as  comparator 
modules.  They  also  transmit  further  the  contents  of  the  e-  cells  and 
keep  records  of  their  activities  at  each  time  unit  of  the  sorting. 
The  interesting  point  is  that  we  are  not  interested  in  the  result  of 
the  merging  itself.  For  each  common  memory  address  A.  ,  that  an  F&A 
instruction  which  relates  to  it  is  being  simulated  in  the  first  cycle, 
let  us  define  the  following  directed  graph.  Each  line  of  the  merging 
network  that,  transmits  either  the  address  Aj  itself  or  an  F&A 
instruction  that  relates  to  it  corresponds  to  an  edge  of  this  digraph. 
The  edge  gets   the  direction  of  this  transmission.   Each  end  point  of 
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the  edge  Is  a  vertex  of  the  digraph.   The  paper  proves   the   following 
general  property  of  merging  networks. 

Property  of  merging  networks.   The  digraph  of  A,  contains  the  following 
rooted  tree  as  a  subgraph  (this  rooted  lies  on  the  side  in  Fig.   1): 

(1)  The  memory  processor  corresponding  to  address  A.  (M^   .    p.     ,  •> 

is  the  root  of  the  tree. 

(2)  All  outputs  of  the  merging  network  that  receive  an  F&A  instruction 
which  relates  to  A.  are  leaves  of  this  tree. 

At  this  point  we  would  like  to  interrupt  the  description  of  the 
simulation  and  relate  the  simulation  to  lucid-box  compositions.  A 
lucid-box  composition  (in  the  wide  sense  of  distributed  computation) 
using  sorting  and  merging  networks  is  exercised  in  the  design  of  both 
the  architecture  of  the  SDC  and  its  operation.  The  aforementioned 
property  of  merging  networks  enables  us  to  use  merging  network  in  a 
paricularly  intriguing  way  that  takes  advantage  of  intermediate  results 
that  must  be  achieved  by  any  merging  network  and  does  not  care  about 
the  output  of  the  merging.  Apparently,  none  of  the  designers  of 
merging  networks  have  expected  their  solutions  to  be  used  not  for  their 
predeclared  outputs.  Lucid-box  compositions  enable  and  encourage  the 
awareness  for  utilizing  non-predeclared  outputs  of  existing  procedures. 

Step  3_.  The  partial  sums  needed  for  the  simulation  of  the  F&A 
instructions  are  compued  by  moving  synchronously  from  the  leaves  of 
each  tree  (that  was  described  above)  to  its  root  and  back  to  the 
leaves.  This  correspnds  to  a  move  from  right  to  left  in  Figure  1 
followed  by  a  move  from  left  to  right. 

Step  4_.  The  partial  sums  are  sent  through  the  merging  and  sorting 
networks  back  to  the  super-processors. 

Each  of  the  following  cycles  is  being  processed  in  the  same  way. 

Pipelining.  Several  cycles  may  have  F&A  Instructions  that  relate  to 
the  same  common  memory  address.  Each  of  the  trees  that  correspnds  to  a 
specific  common  memory  address  in  various  cycles  is  rooted  at  its 
memory-processor.   This  enables  pipelining  of  cycles  with  constant  time 
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delays  between  successive  ones.  This  raeraory-processor  serves  as  the 
link  In  the  chain  between  successive  cycles  that  relate  to  the  common 
memory  location.  The  paper  elaborates  more  on  this  point.  The 
following  theorem  can  be  finally  stated  about  the  PDDI  General  Purpose 
Computer.  Let  f(s,ra)  (resp.  £(s,ra))  be  the  sum  of  the  number  of 
comparator  processors  (resp.  longest  directed  paths)  in  a  selected 
sorting  network  of  s  elements  and  a  selected  merging  network  of  lists 
of  length  s  and  m. 

Theorem.  Given  an  algorithm  with  time  0(— )  for  all  p  <  x  in  a  F&A  PRAM 
p 

with  p  processors  and  m  common  memory  locations  where  t,  x   and   m  are 


some   numbers.    We   can   simulate  it  in  SDC  with  s  super-processors,  m 

meraory-pn 

x/il  (s  ,m) . 


memory-processors   and   f(s,m)   comparator-processors    in   time   0(— )    for   s  < 

s 


Remark.  For  Batchers'  sorting  and  merging  networks  we  get  £(s,m)  = 
OClog'^s  +  log  m)  and  f(s,m)  =  0(s  log  s  +  ra  log  m). 

Replacing  this  sorting  network  by  the  one  suggested  in  [AKS] 
results  in  replacing  log  s  by  log  s  which  is  very  favorable.  However 
large  constants  will  have  to  be  taken  into  account  in  this  case. 

This  result  compares  favorably  with  other  related  results.  It 
employs  less  auxiliary  processors  than  [Ec] ;  solves  a  wider  problem 
than  [GP]  and  improves  on  each  of  these  papers  for  the  respective 
problems  that  they  solve.  The  proximity  of  our  machine  to  sorting  and 
merging  networks  may  enable  us  to  take  advantage  of  the  richness  of 
accomplishments  regarding  layouts  and  other  implementation  notions  of 
such  networks  in  various  technologies. 

Example  III.   Parametrized  Computing 

The  topic  of  Parametrized  Comuting  was  initiated  by  N.  Megiddo  in 
[Megl].  We  will  not  elaborate  here  on  Parametrized  Computing  much 
beyond  the  examle  given  in  the  introduction.  The  reader  is  invited  to 
check  any  of  our  declarations  regarding  Parametrized  Computing  on  this 
example.  An  abstract  example  which  is  similar  to  some  extent  to  the 
one  given  below  can  be  found  in  [Meg2] .  Suppose  that  F(X)  is  a 
monotone  function  of  the  real  variable  X  and  problem  A  is  to  evaluate  F 
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at  a  given  X.  Suppose  that  problem  B  Is  to  solve  the  equation  F(X )  = 
0.  Let  us  retrict  ourselves  to  algorithms  for  problem  A  that  satisfy 
the  following:  throughout  their  execution  the  variable  X  is  involved 
only  in  additions,  comparisons  and  multiplications  by  constants.  It  is 
typical  to  Parametrized  Computing  to  construct  an  efficient  algorithm 
for  problem  B  by  a  lucid-box  composition  that  employs  an  efficient 
algorithm  for  problem  A..  The  specification  of  this  construction  does 
not  typically  need  a  full  specification  of  the  algorithm  for  problem  A. 
The  correspondence  between  the  efficiency  criteria  for  the 
(hypothetical)  algorithm  for  A  and  the  efficiency  criteria  for  the 
algorithm  for  B  is  sometimes  simple.  For  instance,  the  example  which 
is  given  in  the  introduction  applies  the  same  efficiency  criteria  (time 
or  space  complexity)  in  order  to  measure  the  performance  of  both  the 
median  algorithm  and  the  parametrized  median'  algorithm.  However, 
[Meg2]  proposes  a  subtler  correspondence.  In  some  cases  a  fast 
parallel  median  median  algorithm  that  uses  a  small  number  of  processors 
for  A  implies  a  fast  sequential  algorithm  for  B. 

We  finish  this  section  by  mentioning  more  works  that  applied  the 
notions  of  Parametrized  Computing  for  solving  problems  taken  from 
different  fields  (such  as  networks,  scheduling,  location,  geometry  and 
statistics)  [CD],  [Qui],  [Gu2],  [UN],  [L]  ,  [Meg4]  ,  [Meg5]  ,  [MTl]  and 
[MT2].  See  [Meg3]  for  an  example  of  the  effectiveness  of  a  repeated 
application  of  lucid-box  compositions.  These  works  demonstrate  many 
instances  where  applications  of  lucid-box  compositions  improved  on 
previous  solutions. 
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